R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(ggplot2)
getwd()
## [1] "/Users/maxalekhnovich/Downloads"
redInfo <- read.csv('wineQualityReds.csv')
#just getting some basic stats on the dataset
names(redInfo)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
summary(redInfo)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
length(redInfo)
## [1] 13
#noticed that there are 1599 observations based on 13 

#lets just see some distributions of the count values 
#ill start with sulphates 

sulphateBar  =  ggplot(data  =redInfo,aes( x  = redInfo$sulphates, fill = redInfo$quality))+geom_bar()
#lets now see if we can change the width of the bars to values under 1 so we get the majority
sulphateBar  = sulphateBar+xlim(0.25,1.0)
#time to add left limit for x  as well
sulphateBar
## Warning: Removed 58 rows containing non-finite values (stat_count).

#lets try seeing another distribution of the alcohol values
 ggplot(data  =redInfo,aes( x  = redInfo$alcohol, fill = redInfo$quality))+geom_bar()

#lets take the sulphate variable and use it with another variable such as pH

ggplot(data = redInfo,aes(x = redInfo$sulphates, y = redInfo$pH, colour = redInfo$quality))+geom_point(alpha = 1/5,position = position_jitter(w =0.1,h = 0), color = "black", color = redInfo$quality)+geom_line(fill=redInfo$quality)+scale_color_continuous(low = "green",high = "red")
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: Ignoring unknown parameters: fill

#lets try fixed vs acidity see if there is a pattern related to alcohol levels
ggplot(data = redInfo,aes(x = redInfo$fixed.acidity, y = redInfo$volatile.acidity, colour = redInfo$alcohol))+geom_point()+geom_line(fill=redInfo$quality)+scale_color_continuous(low = "green",high = "red")
## Warning: Ignoring unknown parameters: fill

#can see that red lines connected to red points show positive assocation between alcohol and quality, but the best white wines have low volatile and fixed acidity

#lets try a boxplot to get a different kind of a graph  graphing the quality based on the residual sugar
ggplot(data = redInfo, aes(x = redInfo$quality, y = redInfo$residual.sugar))+geom_point()

#summary(redInfo)
levels(redInfo$pH)
## NULL
#seeing the quantity of each quality of wine
ggplot(data = redInfo, aes(x = redInfo$quality))+geom_bar()

#seems like there are alot of wines of average qualities and then a small amount of low and high quality wines
#would I notice anything different if I add a weight category to this graph
ggplot(data = redInfo, aes(x = redInfo$quality, weight = redInfo$alcohol))+geom_bar()

#interesting to note that wine qualities of 5 & 6 have evened out
#now lets add a facet wrap fo fixed.acidity to see how that affects it
ggplot(data = redInfo, aes(x = redInfo$fixed.acidity))+geom_bar(fill = "red")+facet_wrap(~redInfo$quality)

# a little surprising to see that wines of really low and really high quality have a similar amount of fixed acidity

#will try a similar plot with volatile acidity to see if there is a different result than fixed.acidity
ggplot(data = redInfo, aes(x = redInfo$volatile.acidity))+geom_bar(fill = "red")+facet_wrap(~redInfo$quality)

#will there be a difference if i were to use density
ggplot(data = redInfo, aes(x = redInfo$density))+geom_bar(fill = "blue")+facet_wrap(~redInfo$quality)

#wines of the highest quality are in a specific range thats find out what that range is

#lets try splitting the quality into 2: high quality and low quality
highestQuality <-redInfo[ which(redInfo$quality>5), ]

lowerQuality <- redInfo[which(redInfo$quality<=5),]

#set up similar variables with the quartiles of ph
#lets find out what the quartiles of pH are first

#summary(redInfo$pH)

#trying to distribute the pH based on q1,q2,q3
lowPH <-redInfo[which(redInfo$pH<3.211),]
mediumPH<- redInfo[which((redInfo$pH>=3.211) & (redInfo$pH<=3.31)),]
highPH <- redInfo[which(redInfo$pH>3.31),]

redInfo$quality <- factor(redInfo$quality,
                           labels = c(3, 4, 5, 6, 7,8))

lowPhResidualChlorides = ggplot(data = lowPH, aes(x = lowPH$residual.sugar, y = lowPH$chlorides), color = redInfo$quality)+geom_point(alpha = 1/10)
mediumPhResidualChlorides = ggplot(data = mediumPH, aes(x = mediumPH$residual.sugar, y = mediumPH$chlorides))+geom_point(alpha = 1/10)
highPhResidualChlorides = ggplot(data = highPH, aes(x = highPH$residual.sugar, y = highPH$chlorides))+geom_point(alpha = 1/10)
#lets graph them all on the same plane
#lets try the same with a facet wrap now
#lowPhResidualChlorides
library(gridExtra)

grid.arrange(lowPhResidualChlorides,mediumPhResidualChlorides,highPhResidualChlorides)

#lets try it for a different arrangement such as citric acid vs tiotal  sulfer dioxide
lowPhcitricTotal = ggplot(data = lowPH, aes(x = lowPH$citric.acid, y = lowPH$total.sulfur.dioxide), color = redInfo$quality)+geom_point(alpha = 1/10, width = .25)
## Warning: Ignoring unknown parameters: width
mediumPhcitricTotal = ggplot(data = mediumPH, aes(x = mediumPH$citric.acid, y = mediumPH$total.sulfur.dioxide))+geom_point(alpha = 1/10, width =.25)
## Warning: Ignoring unknown parameters: width
highPhcitricTotal = ggplot(data = highPH, aes(x = highPH$citric.acid, y = highPH$total.sulfur.dioxide))+geom_point(alpha = 1/10, width  =.25)
## Warning: Ignoring unknown parameters: width
grid.arrange(lowPhcitricTotal,mediumPhcitricTotal,highPhcitricTotal)

#you can see that the high ph values have a very low amount of citric acid and tend to have a higher value of total sulfer dioxide

#lets add a quality variable to the color of the graph to see if there is a reflection based on the quality
lowPhcitricTotal=lowPhcitricTotal+geom_jitter(aes(colour = lowPH$quality))
mediumPhcitricTotal = mediumPhcitricTotal+geom_jitter(aes(colour = mediumPH$quality))
highPhcitricTotal = highPhcitricTotal+geom_jitter(aes(colour = highPH$quality))
grid.arrange(lowPhcitricTotal,mediumPhcitricTotal,highPhcitricTotal)

#some overplotting exists, lets try it with an alpha value
highQualitySugarVchloride = ggplot(data = highestQuality, aes(x = highestQuality$residual.sugar, y=highestQuality$chlorides))+geom_point(alpha = 1/4)


#lets try the same graph as before but with loweerQuality variable we created
lowerQualitySugarVChloride=ggplot(data = lowerQuality, aes(x = lowerQuality$residual.sugar, y=lowerQuality$chlorides))+geom_point(alpha = 1/4)
# a little bit less chlorides than the higherquality 

#bargraph of highquality wines
library(gridExtra)
#lets try to combine the two graphs to see if the differences are more noticable
#lets try again while scaling the ylim and xlim to remove outliers
highQualitySugarVchloride = highQualitySugarVchloride+ylim(0,0.4)+xlim(0,11)
lowerQualitySugarVChloride = lowerQualitySugarVChloride+ylim(0,0.4)+xlim(0,11)
grid.arrange(highQualitySugarVchloride,lowerQualitySugarVChloride)
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 14 rows containing missing values (geom_point).

#lets try a similar graph wth other variables
#in this case citric acid vs free sulfer dioxide 
highQualityCitricVAlcohol = ggplot(data = highestQuality, aes(x = highestQuality$citric.acid, y=highestQuality$free.sulfur.dioxide))+geom_point(alpha = 1/4)


#lets try the same graph as before but with loweerQuality variable we created
lowerQualityQualityCitricVAlcohol=ggplot(data = lowerQuality, aes(x = lowerQuality$citric.acid, y=lowerQuality$free.sulfur.dioxide))+geom_point(alpha = 1/4)

#grid.arrange(highQualityCitricVAlcohol, lowerQualityQualityCitricVAlcohol)

#lets try again with a higher value for alpha
highQualityCitricVAlcohol = ggplot(data = highestQuality, aes(x = highestQuality$citric.acid, y=highestQuality$free.sulfur.dioxide))+geom_point(alpha = 1/8)

#lets also add a xlim so the graphs have the same limits-disregard outliers close to 1.0
lowerQualityQualityCitricVAlcohol=ggplot(data = lowerQuality, aes(x = lowerQuality$citric.acid, y=lowerQuality$free.sulfur.dioxide))+geom_point(alpha = 1/8)
lowerQualityQualityCitricVAlcohol= lowerQualityQualityCitricVAlcohol+xlim(0,0.8)

grid.arrange(highQualityCitricVAlcohol, lowerQualityQualityCitricVAlcohol)
## Warning: Removed 1 rows containing missing values (geom_point).

#graphs are somewhat similar but there is a heavyier pattern near (.5,10) for the first graph that is not present in the bottom grid


#lets examine density vs sulfer dioxide
#lets make the colors stand out
 plot1 = ggplot(data = redInfo, aes(x = redInfo$density, y = redInfo$total.sulfur.dioxide, color = redInfo$quality))+scale_color_continuous(low = "blue",high = "red")+geom_point()
#lets try the same graph with a facet wrap of quality rather than the color
#will also change the color variable to represent the free sulfer dioxide
 ggplot(data = redInfo, aes(x = redInfo$density, y = redInfo$total.sulfur.dioxide, color = redInfo$free.sulfur.dioxide))+scale_color_continuous(low = "blue",high = "red")+geom_point()+facet_wrap(~redInfo$quality)

#lets try a different combination with the same color scheme, citric acid vs residual sugar, and color equaling alcohol
ggplot(data = redInfo, aes(x = redInfo$citric.acid, y = redInfo$residual.sugar, color = redInfo$alcohol))+scale_color_continuous(low = "black",high = "yellow")+geom_point()+facet_wrap(~redInfo$quality) 

#graph demonstrates that the quality of the wine is in a particular range in both residual sugar as well as alcohol
#really poor quality wines also have either extremely low citric acid or alot of citric acid based on the graph
 
 
 
 
 #lets try using a facet wrap of a different variable such as residual sugar
 badplot=ggplot(data = redInfo, aes(x = redInfo$density, y = redInfo$total.sulfur.dioxide))+facet_wrap(~redInfo$residual.sugar)
 #obviously thats not gonna work because there is too many residual sugar unique values
 
 #lets try again by factoring the residual sugar values based on the four quartiles
summary(redInfo$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
redInfo$quality <- factor(redInfo$quality,
                           labels = c(3, 4, 5, 6, 7,8))
ggplot(redInfo, aes(x =redInfo$quality, y =redInfo$residual.sugar))+geom_boxplot(fill = 'purple', colour = 'orange', alpha = 0.7)+scale_y_continuous(name = 'residual sugar', breaks = seq(2,10,.5))

#interesting to note that there are very fewer outliers below Q1
#also after adjusting the breaks, it is easy to see that wines of a higher quality on average have a residual sugar between 2.5 and 3


#lets try total sulfer dioxide vs quality in a similar boxplot
ggplot(redInfo, aes(x =redInfo$quality, y =redInfo$total.sulfur.dioxide))+geom_boxplot(fill = 'purple', colour = 'yellow', alpha = 0.7)+scale_y_continuous(name = 'total sulfer dioxide')

More Exploration

## [1] 3 8
## [1] 2.74 4.01

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

FINAL PLOTS

## [1] 3 8
## [1] 2.74 4.01

## [1]   6 289
## [1]   6 165
## [1]   7 160
## Warning: Ignoring unknown parameters: width

CONCLUSION After analyzing the dataset of 1599 wines based on the 13 variables provided I came to some conclusions.To start, the quality of the wine is heavily impacted by the pH levels of the wine. Wines of lower quality tend to have a higher value of pH in comparison to wines of a higher quality. Also, wines of higher quality tend to not have too much alcohol or too little.They also did not have too much residual sugar Wines that are very poor tend to have a very small amount of alcohol. Wines that have a larger amount of alcohol are most often average.I also observed a positive relationship with pH, citric acid, and total sulfer dioxide. The wines that had a high pH had a citric acid value of zero or close to zero. The citric acid for low pH wines were 0.5 and medium pH wines were less distinct and had a much wider range, on average. In conclusion, I learned that wines of higher qualities tend to have a good balance of significant factors such as pH, residual sugar, and alcohol levels. Most often when wines had a significantly high or low value for these specific factors the quality of the wine was not very high. I wished there were more variables that would of analyzed either factors such as price or sales. I think it would have made the project more interesting to examine the different wines at different price points, especially in comparison to the quality of the wines. Also, I had some issues with trying to categorize the data initially with some of the variables considering their ranges were very large. The quality variable was perfect in the sense that it’s range was only between three and eight. I overcame this by making other categorical variables such as “lowPH”, “mediumPH” and “largePH” variables that I could use to look at different aspects of the data as well.